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SITE MINING STYLESHEET GENERATOR 
INVENTOR: DOUGLAS R. JAKUBOWSKI 

FIELD OF THE INVENTION 

The present invention relates generally to stylesheets used for mining 
content from web pages and, more particularly, to the generation of these 
stylesheets using site mining expressions for uniquely locating content to be 
extracted and/or transformed. 

BACKGROUND OF THE INVENTION 

Organizations of all sizes rely on the Internet to conduct business. 
Because of the explosion of mobile enterprise solutions, users of wireless or 
mobile devices are increasingly demanding the delivery of web content for 
viewing on a variety of platforms ranging from desktop computing units to 
wireless portable (e.g., handheld) devices such as personal digital assistants 
(PDAs) and wireless phones. Whether organizations are creating new web 
applications, or extending existing infrastructure, the new Internet powered 
world demands that users have access to this content to remain flexible and 
competitive, and drive stronger customer relationships. 

Currently, the appearance of this content varies greatly depending on 
the platform in which the content is displayed. For example, because of 
display and bandwidth limitations, a user utilizing a PDA oftentimes cannot 
access a web page designed for display on a desktop computer, at least not in 
the manner contemplated by the page designer. 

In many cases, certain pieces of content, including memory intensive 

content such as graphics for example, simply does not need to be displayed to 

a mobile device user to convey the point of the source page. By displaying 

only a selected subset of the information from the source page, content may be 

displayed on a particular platform in a manner that meets the requirements of 

1 



109085-123 



PATENT 



the requesting device. 

A need therefore exists for a technique utilizable for displaying only a 
specific subset of the source page content (i.e., site mining). A need also 
exists for a technique that allows the selected content to be transformed or 
5 further manipulated before being displayed to the end user. In addition, a need 

exists for a technique suitable for generating an expression for uniquely 
locating or identifying the location of content in a page so that the content may 
be extracted and/or manipulated during the site mining process. 

10 SUMMARY OF THE INVENTION 

The present invention addresses the above and other needs of the prior 
art by providing a method, system and medium for generating a site mining 
stylesheet. Generally speaking, site mining stylesheets are utilized to dictate 
the presentation of information or data on, for example, a screen, display or 
15 some form of medium. In addition, embodiments of the present invention 

contemplate that these stylesheets may be utihzed for extracting content from 
a particular web page. After extraction, this content may be transformed 
and/or manipulated (using the stylesheet) before being displayed on a mobile 
device. 

20 In use, the stylesheets may be stored on a proxy server or the like and 

called when a web page associated with the stylesheet is requested by the 
mobile device. From there, the stylesheet may be applied to the requested web 
page to produce a resultant or destination page, which in turn may be 
transmitted to the requesting mobile device for display. Thus, information or 

25 web pages originally designed for display on one device or medium may be 

altered or reformatted with the addition or omission of data before being 
presented on another device. 

More specifically, embodiments of the invention contemplate first 
designing a site mining template utilizable for generating the site mining 

30 stylesheet. Afterwards, the stylesheet may be applied to a source page to 
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produce a destination page containing any extracted and/or reformatted 
content from the original source page. This site mining template may be 
created by receiving and storing format information for formatting a layout of 
the stylesheet. Similarly, an indication of the content to be extracted from the 
5 source page may also be added to the template. To identify the content, an 
expression for uniquely locating each piece of content to be extracted and/or 
manipulated may be determined or generated. In addition to this formatting 
and expression information, transformation information for manipulating the 
content may be included with the template. Once the template has been 

10 completed, it may be converted into the stylesheet and prepared for application 
to a corresponding source web page. In this manner, the appearance and 
information presented in a resultant destination page may be customized 
according to the needs and limitations of a particular device and/or user. 

It is to be understood that the invention is not limited in its application 

15 to the details of construction and to the arrangements of the components set 

forth in the following description or illustrated in the drawings. The invention 
is capable of being implemented in a number of embodiments and of being 
practiced and carried out in various ways. As such, those skilled in the art will 
appreciate that the conception, upon which this disclosure is based, may 

20 readily be utilized as a basis for the designing of other structures, methods and 
systems for carrying out the several purposes of the present invention. It is 
important, therefore, that the claims be regarded as including such equivalent 
constructions insofar as they do not depart from the spirit and scope of the 
present invention. 

25 

BRIEF DESCRIPTION OF THE DRAWINGS 

The Detailed Description will be best understood when read in 
reference to the accompanying figures wherein: 

FIG. 1 is a block diagram representation of an architecture utilizable 
30 for generating a site mining stylesheet according to embodiments of the 
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present invention; 

FIG. 2 illustrates one example of a flow diagram depicting the 
utilization and generation of a stylesheet of the present invention; 

FIG. 3 is a flow diagram illustrating an exemplary process utilizable 
5 for generating a site mining stylesheet; 

FIG. 4 is a flow diagram illustrating an exemplary process utilizable 
for displaying content contained by a web page; 

FIG. 5 is a flow diagram illustrating an exemplary process utilizable 
for generating a site mining expression for uniquely locating content selected 
10 from a web page; 

FIG. 6 is a flow diagram illustrating an exemplary process utilizable 
for generating a site mining expression for uniquely locating content selected 
from a web page, in which multiple selection criteria are used; 

FIG. 7 is a flow diagram illustrating an exemplary process for 
15 converting a template into a site mining stylesheet of the present invention; 

FIG. 8 illustrates one example of a central processing unit utilizable for 
implementing a computer process of the present invention; and 

FIG. 9 illustrates one example of a block diagram of internal hardware 
of the central processing unit of FIG. 8. 

20 

DETAILED DESCRIPTION OF THE INVENTION 



FIG. 1 illustrates an architecture utilizable for implementing various 
aspects of the present invention. In the embodiment depicted in FIG. 1, a 
25 proxy server 100 is linked to and utihzable with a developer workstation 150, 

server 160, and mobile device 170. Examples of proxy server 100 and server 
160 include any of a number of mainframe and/or personal computing devices 
such as those utilizing Enterprise System Architecture/370 offered by 
International Business Machines Corporation of Armonk, New York. 
30 Examples of mobile device 170 may include any of a number of handheld 
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devices such as those offered by Palm, Inc. of Santa Clara, California 
including, for example, Palm Vn devices. In addition, other examples of 
mobile device 170 may include any of a number of different types of 
computers, such as those having Pentium™-based processors manufactured by 
5 Intel Corporation of Santa Clara, California. Developer workstation 150, on 

the other hand, may exist as any of the computing devices discussed above. In 
addition, developer workstation 150 may also be implemented as one or more 
computing routines processing on proxy server 100, In this manner, 
embodiments of the present invention contemplate that developer workstation 

10 150 may be utilized in conjunction with proxy server 100 to generate site 
mining stylesheets, which, upon request from mobile device 170, may be 
applied to web pages maintained in server 160 to deliver customized content 
to mobile device 170. 

Referring again to FIG. 1, proxy server 100, mobile device 170, server 

15 160, and in some embodiments developer workstation 150, may be linked or 
interconnected with each another via one or more data communication 
networks 180. Examples of these data communication networks include hard- 
wired and wireless LANs/WANs, direct dial-up lines, the Internet, intranets, 
and the Uke, Thus, data or content contained in web sites or web pages 162 

20 maintained on server 160 may be accessed via, for example, the Internet by 

mobile device 170. In particular, a browser process 172 implemented on 
mobile device 170 may be utilized to formulate a search request or query with, 
for instance, a search function or a Universal Resource Locator (URL). In 
turn, this request or query is received by server 160 (e.g., via a HyperText 

25 Transport Protocol (HTTP) or the like), which in response transnoits the web 

page containing the requested data to mobile device 170. From there, browser 
172 processes the results and displays the originally requested web page 
content. Typically, the web page is embodied in a HyperText Markup 
Language (HTML), Extensible Markup Language (XML), Wireless Markup 

30 Language (WML) file, or the like, and may include other component files such 
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as sound (e.g., .wav) or graphics (e.g., .gif) files or the like. With markup 
language examples, embodiments of the present invention contemplate that 
content may include HTML or XML tags and any data or information located 
within, or delineated by, the beginning and ending tags of a tag instance. 
5 Furthermore, although only one mobile device 170 and server 160 are shown 
in the example of FIG. 1, it is to be understood that any number of mobile 
devices and servers may be utilized in accordance with the concepts of the 
present invention. 

As contemplated by embodiments of the present invention, proxy 

10 server 100 facilitates the generation of site mining stylesheets, which, when 

applied to a source web page, may be used to manipulate and/or customize 
selected content retrieved from the source web page. These stylesheets 
contain various rules and/or instructions for transforming the presentation or 
structure of a web page or other document, and may include programs for 

15 formatting web pages as well as commands for transforming other information 

such as magazines and newsprint. In addition, proxy server 100 also 
facilitates the subsequent transmission of this new content to a requesting 
mobile device 170. Thus, these stylesheets, may be used to describe how a 
source page is to be presented on a destination device. By applying a 

20 stylesheet to a source page, the source layout and presentation of the source 

page may be transformed and/or manipulated without sacrificing device- 
independence. As will be discussed below, embodiments of the present 
invention contemplate that the generation of these sylesheets may be 
facilitated through use of one or more site mining templates. 

25 Referring again to FIG. 1, a graphical user interface (GUI) 152 

operating on or in conjunction with developer workstation 150 may be utilized 
to generate any number of site mining templates 104, As one example, these 
templates 104 may be written in XML and may be stored in memory 
accessible by proxy server 100. Templates 104 generally identify the content 

30 from a source page (e.g., one of web pages 162) that is to be displayed or 
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manipulated, as well as how the content is to be displayed and/or manipulated. 
For example, the location of a particular piece of content may be identified 
within a template by one or more site mining expressions. These site mining 
expressions are typically utilized by the stylesheet to locate the content to be 
5 retrieved and may include, for example, an XPath or Document Object Model 
(DOM) expression. In this regard, XPath may be characterized as a language 
or string syntax for addressing or building addresses to specific parts of a web 
page (typically written in XML). Thus, an XPath or other similar expression 
may be used to specify the location of a document structure or content found 

10 in a web page when processing that information. 

To customize the appearance and layout of a resultant destination page, 
template 104 may also be utilized to display (i.e., add) content not extracted 
from the source page as well. To manipulate or otherwise transform content 
selected from the source page, embodiments of the present invention 

15 contemplate that a number of custom tags may be included within template 
104. As will be discussed below, these custom tags may include rules or 
command tags used to control processing of template 104, transformation tags 
used to manipulate the content, or any other similar tags and the like. After 
generating the site mining template 104, these templates may be converted by 

20 a compiler process 108 to produce a number of stylesheets. As one example, 
the stylesheet produced after compilation may be embodied in Extensible 
Stylesheet Transformation code (XSLT) or some other similar rendering 
vocabulary for describing the semantics of formatting information. 

To utilize a stylesheet generated in the above manner, a search request 

25 or query is received from mobile device 170, in the form of, for example a 

decorated URL or proxy request. With a proxy request, a browser may be 
configured to send queries to, for example, a specific HTTP port on a specific 
proxy server. The proxy server Ustens on this port, which may be separate 
from the port used by a web server, and processes all queries it receives. In 

30 other cases, web page references and links contained in a destination page 
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point to a proxy server's web page. The proxy web page accepts the desired 
web page as a parameter, and in this manner the proxy web page is 
"decorated" with a desired URL. In any event, the request is received by a 
proxy process 120 implemented in proxy server 100. Generally speaking, 
5 proxy process 120 acts as an intermediary between the device browser 172 and 

the server 160, and is responsible, in part, for receiving HTTP requests and 
transforming the requests into other formats. In this example, after receiving a 
request from mobile device 170, proxy process 120 calls an engine 116, which 
in turn, applies a stylesheet corresponding to the requested web page. In this 

10 regard, engine 1 16 retrieves the content specified by the stylesheet from a 

source web page 162. From there, any transformations are performed by 
engine 116 before producing a "destination page", which is then transmitted to 
a browser process 172 on the requesting mobile device 170. In this manner, 
selected content mined from a source page may be displayed in a customized 

15 format on a requesting mobile device. 

Referring to FIG. 2 (in conjunction with FIG. 1), one example of a 
process utilizable for generating and implementing a stylesheet of the present 
invention is described. Initially, a web page 162 requested by mobile device 
170 is retrieved by proxy server 100 from server 160 (step 202). 

20 Subsequently, a corresponding stylesheet maintained on proxy server 100 is 

applied to web page 162 by engine 116 (step 206). As described above, the 
stylesheet may identify any number of pieces of content to be extracted from 
web page 162, via, for example, one or more site mining expressions (e.g., via 
XPath expressions or the like). Upon applying the stylesheet to web page 162, 

25 each identified piece of content may be extracted from the original source 

page (step 210), From there, each piece of content may be manipulated or 
otherwise transformed and inserted into a new or destination document along 
with any additional information specified by the stylesheet (step 214). This 
resulting destination document or web page may then be transmitted to the 

30 requesting remote device 170 (step 218). 
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As mentioned above, embodiments of the present invention provide a 
mechanism for selecting content from an original web page and for creating a 
site-mining template in which manipulations and formatting of the selected 
content may be performed. The syntax of the site-mining template is typically 
5 tag-based, utilizing any number of standard tags (including those offered with 

HTML, XML, or the like) as well as a variety of custom tags for manipulating 
the content. These custom tags may be implemented to provide any number of 
basic programming constructs including variables, looping, conditional and 
output statements, or the like. In addition, links (i.e., site mining expressions) 

10 to the content to be extracted may be placed at any number of desired 

locations within the template. In particular, an expression for locating content 
to be extracted may be determined or generated utilizing, for example, a visual 
process via a graphical user interface or the like. As will be discussed below, 
upon completing this process, a unique expression is generated, which may 

15 then be added to the template. From there, the template may be converted or 
compiled to produce, for example, an XLST stylesheet. Then, when the 
original web page is requested by a mobile device, this stylesheet may be 
applied to the requested page to produce a new destination page containing 
content that has been reformatted or manipulated to meet the specific 

20 requirements of the requesting mobile device. 

One example of a process utilizable for generating a site-mining 
stylesheet of the present invention is described with reference to FIG. 3. 
Initially, the source web page (i.e., the HTML or XML page to be site mined) 
is specified and retrieved by a developer (step 304). For instance, the URL of 

25 the document to be site mined may be entered at developer workstation 150 

via a graphical user interface (GUI), or the like. The specified page is 
retrieved and, if not already in compliance with XML, may be converted to 
XHTML or some other XML-comphant format. In this regard, any number of 
software packages, such as Tidy offered by W3C, may be utilized to convert 

30 HTML documents to XML-comphant form. 
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Subsequently, a site mining template is generated for mining the 
source page. Specifically, working from, for example, developer workstation 
150, the format or layout of the destination page may be designed using any 
combination of HTML or XML tags or the like (step 308). For instance, a 
5 developer may add any number of banners, determine header/footer 

settings/content, set margins, create new tables, add custom text or graphics 
and the like. In addition, embodiments of the present invention contemplate 
that any number of pieces of content may be selected for extraction, or mined, 
from the source page and included in some form in the template (step 312). 
10 As will be discussed below, for each piece of content to be extracted, an 
expression uniquely identifying or locating the content is generated. 
C3 Embodiments of the present invention contemplate that these expressions may 

\j be embodied as XPath or DOM syntax expressions or any other site mining 

expressions utilizable for locating content in a web page written in an 
^ 15 extensible markup language such as XML or the like. Other examples of 

\| extensible markup languages include Math Markup Language (MATHML), 

r „v, Bioinformatic Sequence markup language (BSML), Instrumentation Markup 

W Language (EVIL), Chemical Markup Language (CML), Wireless Markup 

If] Language (WML), Astronomical Instrumentation Markup Language (AIML), 

20 and other similar markup languages. Furthermore, any number of custom tags 

may also be included in the template at this point (step 316). These tags may 
be utilized to manipulate or transform the extracted content as well as control 
the processing flow of the template. For instance, any number of command 
tags, such as loops or if-then tags, may be included to control flow during 
25 template processing. Likewise, any number of rules tags may be included to 

transform or otherwise manipulate selected content. Some examples of these 
transformations and manipulations include string or graphics replacement, 
string or graphics formatting, appending data to strings or graphics, reading 
data and performing additional functions, arithmetic/mathematical 
30 manipulations such as rounding, max/min, counting, summations and/or other 

10 
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similar manipulations. In this manner, the custom tags provide programming 
capability in the stylesheet. 

During the template generation process, any format information, 
custom tags, and site mining expressions may be stored or saved to memory 
(e.g., in a .asl file) implemented in or accessible by proxy server 100. 
Examples of other possible custom tags include: 



"(Custom Tag' 




^ :;...,Comitient-:;:- . ■ ,.,r;-jr. 








Storage Tags 






as-variable 


name, 
pattern 


<pattem> is the unique XPath expression for 
content to extract. <name> is the name of the 
vanable where the content will be copied. 


as -query 


name, 
pattern 


<pattem> is the unique XPath expression for 
content to link. <name> is the name of the query 
where the content will be linked. 








Decision Tags 






as-if 


condition 


<condition> is a valid XPath expression. If the 
expression is evaluated to True, then the content 
contained in the tag is copied to the stylesheet 








Looping Tags 






as-foreach 


name, 
query 


<query> is an XPath expression for content or a 
query defined by as-query. The tag causes the 
nested tags it contains to be executed for each 
element that <query> points to. The value of the 
current iteration is stored in a query named <name> 
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Search Tags 






as-find 


name, 
select, 
pattern 


<select> may be a variable, query, or XPath link to 
content. <pattem> is a valid XPath expression. 
The tag searches the content identified by <select> 
for elements that match the <pattem> expression. 
The results of the search are stored in a variable 
named <name> 








Rule Tags 






as-applyrules 


name, 
query 


<query> is an XPath expression for content or a 
query defmed by as-query. This tag copies the 
content specified by <query> to a variable named 
<name>. The rules contained by tag are applied 
while the content is being copied. 


as-removeattr 


pattern 


<pattem> is a valid XPath expression. Removes all 
attributes that match this expression during the 
content copying. 


as-editattr 


pattern, 
[value, 
scale, min, 
max] 


<pattem> is a vahd XPath expression. This tag 
edits all attributes that match this expression during 
the content copymg. The attnbute s value is set if 
<value> is specified. Otherwise the attnbutes s 
value is scaled and check against the minimum and 
maximum boundaries. 


Output Tags 






as-output 


select 


<select> may be a variable, query, or XPath link to 
content. Performs a deep copy of the <select> s 
content to the stylesheet. 


as-text 




Copies the content contained within the tag to the 
stylesheet 
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Function Tags 






as-iunction 


name 


Defines a function named <name>. Can be 
immediately followed by zero or more as-parameter 
tags. 


as-parameter 


name 


Defines a parameter for the as-function tag 


as-callfunc 


name 


Calls a function defined by as-function, with the 
name <name>. Can be immediately followed by 
zero or more as-callparam tags. 


as-callparam 


name, 
select 


Passes a parameter with name <name> to a function 
defined by as-function. The value of <select> may 
be a variable, query, or XPath link to content. 



The exemplary custom tags listed above are utilizable in conjunction 
with XPath expression syntax. In addition, other custom tags, in addition to 
those listed above may also be implemented. Furthermore, it is to be 
5 understood that other types of site-mining expressions in addition to XPath 

expressions are also utilizable without departing from the scope of the present 
invention. Although the above examples describe the application of a 
stylesheet to a single web page, it is to be understood that embodiments of the 
present invention also contemplate the application of a stylesheet to any 

10 number of web pages conforming to certain specifications. 

Referring again to FIG. 3, after the site-mining template has been 
designed, the template may be converted into a stylesheet by compiler 108 
(step 320). Embodiments of the present invention contemplate that the end 
result of the conversion process may be an XSLT stylesheet in which any 

15 custom tags have been converted into a XSLT format (e.g., a .xsl file), 

although other similar types of stylesheets (e.g., cascading stylesheets) are also 

possible. As discussed above, after conversion, the stylesheet exists in a 

format readable and implementable by engine 1 16 to mine content from a 

13 
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source web page maintained on server 160. In addition, embodiments of the 
present invention contemplate the usage of XLST stylesheets for the 
conversion of the site mining template. In this regard, since the site mining 
template may be XML compliant, a XSLT stylesheet may act as the compiler, 
5 which converts the site mining template into a site mining stylesheet. One 

example of such a conversion procedure, utilizing a two-pass process, will be 
discussed in greater detail below. 

Examples of a number of processes utilizable for generating site- 
mining expressions utilizable for uniquely locating or identifying web page 

10 content are now described. In this regard, embodiments of the present 

invention contemplate that any combination of these processes may be used to 
generate the expressions utilized in step 312 of the stylesheet generation 
process discussed above. Referring now to FIG. 4, one example of a process 
utilizable for displaying one or more pieces of content contained by a web 

15 page is depicted. To commence processing, a web page containing the content 

at issue may be identified or specified using, for example, GUI 152 
implemented on developer workstation 150. For instance, a developer may 
enter a URL specifying a web page from which content is to be extracted. The 
web page specified by the developer may then be retrieved (step 402). 

20 Embodiments of the present invention contemplate that the web page may be 

written in HTML, XML, or other similar languages. This being the case, the 
retrieved web page is then examined to determine its format (step 406). If the 
web page is embodied in a data or text-based format such as XML, the page 
may be parsed with its hierarchy of elements (i.e., content) displayed in, for 

25 example, a tree view (step 422). While displayed in this tree view, each piece 

of content may be displayed in relation to each of the other pieces of content 
contained in the page. 

If, on the other hand, the web page is embodied in a HTML or some 
other graphics-based format, the page is converted into an XML compatible 

30 format such as XHTML (step 410). As mentioned above, any number of 

14 
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software packages, such as Tidy offered by W3C, may be utilized to convert 
HTML documents to XML-compliant form. Once the page has been 
converted into an XML-compliant form, the web page relative links are 
converted to an explicit path or absolute links (step 414). This may be 
5 accomplished using any number of procedures, one example of which is 

discussed in the Internet Engineering Task Force RFC 1808. Subsequently, 
the now XML compatible web page may be displayed (step 418) along with its 
hierarchy of elements in, for example, a tree view (step 422). 

FIG. 5 depicts one example of a process utilizable for generating an 

10 expression for uniquely locating content selected from a web page. First, web 
page content may be selected from, for example, the tree view displayed as per 
step 422 (step 502). As an example, a developer may select content from a 
tree view displayed in GUI 152 by left or right clicking with a mouse on the 
element. For instance, the present invention may be implemented in a manner 

15 which allows the selection of a single piece of content with a left click. In a 

similar manner, right clicking may be arranged to allow more complex 
selections such as selecting each similarly named sibUng (i.e., each element 
residing at a particular level in the tree); each similarly named piece of content 
in the page; each sibling element; or applying additional filtering criteria (e.g., 

20 content that contains specific text). Thus, the present invention may be 

utilized to select each piece of content identified by a HTML 'TABLE" tag 
residing in a particular level. As such, if filtering is desired (step 506), the 
desired filtering criteria may be added to the selected element (step 510) by, 
for example, right chcking on the content and entering the criteria in a pop-up 

25 window, pulldown menu, or other user interface. 

Referring again to FIG. 5, embodiments of the present invention 
contemplate that the selection of a particular piece of content may result in the 
updated display of any graphical components associated with the content. 
Thus, before any content is displayed, the format of the web page may be 

30 examined to determine whether any graphical components exist, by, for 

15 
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example, determining whether the document is in a graphics based or HTML 
format (step 514). If no graphical components exist in the page, the unique 
expression for the selected content is determined (discussed below) and 
displayed (step 522), If the page includes graphical components, the selected 
5 content may be displayed on GUI 152 (step 518) before determining and 

displaying the unique expression corresponding to the selected content (step 
522), Thus, any HTML coded content may be displayed by utilizing, for 
example, a stylesheet or some other similar mechanism, to copy selected 
content and any child elements or text to GUI 152. 

10 To generate an expression for locating content within a page, 

embodiments of the present invention contemplate identifying a unique 
expression for each piece of desired content within the page. Since pages are 
sets of content (tags) nested within one another in a hierarchical manner, it is 
contemplated that this unique expression may be derived from the 

15 concatenation of expressions created by drilling down through this hierarchy. 

As discussed above, the expression basically specifies a path to a piece of 
content. Although embodiments of the present invention contemplate that the 
expression may be embodied as an XPath expression, other formats may also 
be utilized, including, for example a DOM expression. 

20 Several examples illustrating the generation of an XPath expression 

are now described using the following as an original document: 



<html> 
<body> 

25 <hl>Table l</hl> 

<table> 

<tr><td>This is text for row one, table one</td></tr> 
<tr><td>This is text for row two, table one</td></tr> 
</table> 

30 <hl>Table 2</hl> 

16 
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<table> 

<tr><td>This is text for row one, table two</td></tr> 
<tr><td>This is text for row two, table two</td></tr> 
</table> 
5 </body> 
</html> 



Example 1: Selecting a specific tag instance (using an index) 
10 XPath Expression - 

/html^ody/table[l]/tr[l] ==> Select the first row in the first table 
Corresponding Content - 

<tr><td>This is text for row one, table one</td></tr> 

15 Example 2: Select all sibhng tags that have the same name 

XPath Expression - 

/html/body/table[l]/tr ==> Select all rows in the first table 
Corresponding Content - 

<tr><td>This is text for row one, table one</td></tr> 
20 <tr><td>This is text for row two, table one</td></tr> 

Example 3: Select all tags in the document that have the same name 
XPath Expression - 

//tr ==> Select all rows 

25 Corresponding Content - 

<tr><td>This is text for row one, table one</td></tr> 
<tr><td>This is text for row two, table one</td></tr> 
<tr><td>This is text for row one, table two</td></tr> 
<tr><td>This is text for row two, table two</td></tr> 

30 

17 
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Example 4: Select all child elements regardless of name 
XPath Expression - 

/html/body/* ==> Get all children of the html body 

Corresponding Content - 
5 <hl>Table l</hl> 

<table> 

<tr><td>This is text for row one, table one</td></tr> 
<tr><td>This is text for row two, table one</td></tr> 
</table> 

10 <hl>Table2</hl> 
<table> 

<tr><td>This is text for row one, table two</td></tr> 
<tr><td>This is text for row two, table two</td></tr> 
</table> 

15 

Applicable to all cases: Using expression filtering 
XPath Expression- 

//td[contains(text(), "table one)] ==> Get all table cells in the 
document that contain the text 'table one' 
20 Corresponding Content - 

<td>This is text for row one, table one</td> 
<td>This is text for row two, table one</td> 



An example of a process for generating an expression for uniquely 
25 locating content selected from a web page, in which multiple selection criteria 

are utilized, is described with reference to FIG. 6. In this compound selection 
example, content may be extracted based upon the concatenation or 
combination of a plurality of site mining expressions. Initially, this process 
starts by indicating that a currently selected piece of content is to be the root 
30 element of a new document or page to be searched (step 602). This new 
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document is created and processed according to the process described in FIG. 
4, resulting in the display of its pieces of content in tree view (step 606). 
Subsequently, a piece of content from this new document or page may be 
selected and processed according to the process described in FIG. 5 to produce 
5 an expression locating the selected content within the new page (step 610). To 

generate a final expression, the expression used to locate the content within 
the new page is appended to or concatenated with the expression of the current 
page selected in step 602 (step 614). This process may be repeated as many 
times as desired (step 618). Thus, utiHzing the process of FIG. 6, an 

10 expression may be generated to locate content which may move from place to 
place within a single structure. For example, the process of FIG. 6 may be 
used to generate an expression for locating a particular best-selling book 
within a best-selling book table (listed and updated according to the number of 
sales each week) by first selecting the table as the current selection, and then 

15 by entering the name of the book as additional filtering criteria. Hence, 
although the position of the table may shift within the document and the 
position of the book may shift within the table from week-to-week, an 
expression for locating the selected content (the book) may still be generated. 
One example for converting the template into a stylesheet is now 

20 described with reference to FIG. 7. As mentioned above, embodiments of the 

present invention contemplate that compiler 108 may be used to convert a 
template into a stylesheet via, for example, a two pass process. Although the 
example depicted in FIG. 7 illustrates the conversion of a template into a 
XSLT stylesheet, as mentioned above, other stylesheets may also be produced. 

25 In addition, single pass and other multiple pass processes may also be 

utilized. Referring to FIG. 7, the first pass is responsible for creating a main 
body of the stylesheet. During this pass, any custom tags (step 702) may be 
replaced with equivalent XSLT syntax (step 710). Non-custom tags, on the 
other hand, are copied directly onto the stylesheet (step 706). This process is 

30 repeated until each tag in the template has been evaluated (step 714). 
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The second pass allows the custom tags to create any additional XSLT 
syntax that is required outside of the stylesheet's main body. For example, 
this may be required for custom tags that use the XSLT command, xshapply- 
templates. The templates used by this and other similar commands are 
5 typically located outside the main body and may be created at this time, if 

necessary (step 722). Non-custom tags are generally applicable to the main 
body and may therefore be ignored during this step. Again this process is 
repeated until each tag in the template has been evaluated (step 726). 

The techniques of the present invention may be implemented on a 

10 computing unit such as that depicted in FIG. 8. In this regard, FIG. 8 is an 

illustration of a computer system which is also capable of implementing some 
or all of the computer processing in accordance with computer implemented 
embodiments of the present invention. The procedures described herein are 
presented in terms of program procedures executed on, for example, a 

15 computer or network of computers. 

Viewed externally in FIG. 8, a computer system designated by 
reference numeral 1 100 has a computer portion 1 102 having disk drives 1 104 
and 1106. Disk drive indications 1104 and 1106 are merely symbohc of a 
number of disk drives which might be accommodated by the computer system. 

20 Typically, these would include a floppy disk drive 1 104, a hard disk drive (not 

shown externally) and a CD ROM indicated by slot 1 106. The number and 
type of drives vary, typically with different computer configurations. Disk 
drives 1104 and 1106 are in fact optional, and for space considerations, are 
easily omitted from the computer system used in conjunction with the 

25 production process/apparatus described herein. 

The computer system also has an optional display 1 108 upon which 
information may be displayed. In some situations, a keyboard 1110 and a 
mouse 1 1 12 are provided as input devices through which input may be 
provided, thus allowing input to interface with the central processing unit 

30 1 102. Alternatively, for enhanced portabiHty, the keyboard 1 1 10 is either a 
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limited function keyboard or omitted in its entirety. In addition, mouse 1112 
optionally is a touch pad control device, or a track ball device, or even omitted 
in its entirety as well, and similarly may be used as an input device. In 
addition, the computer system 1100 may also optionally include at least one 
5 infrared (or radio) transmitter and/or infrared (or radio) receiver for either 
transmitting and/or receiving infrared signals. 

Although computer system 1100 is illustrated having a single 
processor, a single hard disk drive and a single local memory, the system 1 100 
is optionally suitably equipped with any multitude or combination of 

10 processors or storage devices. Computer system 1 100 may be replaced by, or 
combined with, any suitable processing system operative in accordance with 
the principles of the present invention, including hand-held, laptop/notebook, 
mini, mainframe and super computers, as well as processing system network 
combinations of the same. 

15 FIG. 9 illustrates a block diagram of exemplary internal hardware of 

the computer system 1 100 of FIG. 8. A bus 1202 serves as the main 
information highway interconnecting the other components of the computer 
system 1100. CPU 1204 is the central processing unit of the system, 
performing calculations and logic operations required to execute a program. 

20 Read only memory (ROM) 1206 and random access memory (RAM) 1208 

constitute the main memory of the computer 1 102. Disk controller 1210 
interfaces one or more disk drives to the system bus 1202. These disk drives 
are, for example, floppy disk drives such as 1104 or 1106, or CD ROM or 
DVD (digital video disks) drive such as 1212, or internal or external hard 

25 drives 1214. As indicated previously, these various disk drives and disk 

controllers are optional devices. 

A display interface 1218 interfaces display 1208 and permits 
information from the bus 1202 to be displayed on the display 1108. Again as 
indicated, display 1108 is also an optional accessory. For example, display 

30 1108 could be substituted or omitted. Conmiunications with external devices, 
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for example, the other components of the system described herein, occur 
utilizing communication port 1216. For example, optical fibers and/or 
electrical cables and/or conductors and/or optical communication (e.g., 
infrared, and the like) and/or wireless communication (e.g., radio frequency 
5 (RF), and the like) can be used as the transport medium between the extemal 
devices and conrimunication port 1216. Peripheral interface 1220 interfaces the 
keyboard 1110 and the mouse 1112, permitting input data to be transmitted to 
the bus 1202. 

In alternate embodiments, the above-identified CPU 1204, may be 

10 replaced by or combined with any other suitable processing circuits, including 

programmable logic devices, such as PALs (programmable array logic) and 
PLAs (programmable logic arrays). DSPs (digital signal processors), FPGAs 
(field progranomable gate arrays), ASICs (application specific integrated 
circuits), VLSIs (very large scale integrated circuits) or the like. 

15 One of the implementations contemplated by embodiments of the 

present invention is as sets of instructions resident in the random access 
memory 1208 of one or more computer systems 1100 configured generally as 
described above and/or as a transmission (e.g., digital signals). Until required 
by the computer system, the set of instructions may be stored in another 

20 computer readable memory, for example, in the hard disk drive 1214, or in a 
removable memory such as an optical disk for eventual use in the CD-ROM 
1212 or in a floppy disk for eventual use in a floppy disk drive 1 104, 1 106. 
Further, the set of instructions (such as those written in the Java programming 
language) can be stored in the memory of another computer and transmitted in 

25 a transmission means such as a local area network or a wide area network such 
as the Internet 180 when desired by the user. One skilled in the art knows that 
storage or transmission of the computer program product changes the medium 
electrically, magnetically, or chemically so that the medium carries computer 
readable information. 

30 In general, it should be emphasized that the various components of 
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embodiments of the present invention can be implemented in hardware, 
software, or a combination thereof. In such embodiments, the various 
components and steps would be implemented in hardware and/or software to 
perform the functions of the present invention. Any presently available or 
future developed computer software language and/or hardware components 
can be employed in such embodiments of the present invention. For example, 
at least some of the functionahty mentioned above could be implemented 
using Java, C, or Ch-+ programming languages. 

It is also to be appreciated and understood that the specific 
embodiments of the invention described hereinbefore are merely illustrative of 
the general principles of the invention. Various modifications may be made 
by those skilled in the art consistent with the principles set forth hereinbefore. 

The many features and advantages of the invention are apparent from 
the detailed specification, and thus, it is intended by the appended claims to 
cover ail such features and advantages of the invention which fall within the 
true spirit and scope of the invention. Further, since numerous modifications 
and variations will readily occur to those skilled in the art, it is not desired to 
limit the invention to the exact construction and operation illustrated and 
described, and accordingly, all suitable modifications and equivalents may be 
resorted to, falling within the scope of the invention. While the foregoing 
invention has been described in detail by way of illustration and example of 
preferred embodiments, numerous modifications, substitutions, and alterations 
are possible without departing from the scope of the invention defined in the 
following claims. 



23 



